Previously, we discovered a number of microbial proteins with intriguing structural similarity to human immune-effector molecules. Experimentally determined protein structures alongside AI folded proteins (AF & ESM) were used as inputs and searched against multiple databases of microbial proteins, namely the AlphaFold-UniProt50 and ESM-MGnify databases.
Ultimately, our goal is to explore the prevalence and potential function of these proteins in the human microbiome. To pare down our list of potential proteins to study, we will apply a filter for human microbiome relavence.
Objectives
In this analysis we will determine which genes with structural homology to human proteins of interest are present in the human microbiome.
To accomplish this, we will access metagenome-assembled genomes (MAGs) and cultured isolate representative species genomes from human gut microbiome samples. We will then search these genomes for the presence of our genes of interest.
Core questions:
Out of the 4,783 tested target proteins of interest, how many have any detection whatsoever in the human microbiome?
At the ensembl ID level, how many of the target proteins are detected in the human microbiome?
What is the distribution of detection for each foldseek target protein? (i.e. how many times is each protein detected across and within different genomes).
Are there ensembl ID’s that have widespread detection across multiple species? If so, what are the species that have the most hits?
Are there representative species that have multiple proteins of interest? If so, are there trends at higher level clades (eg. Phylum level)?
Are there correlations with genome contamination and detection of target proteins?
Which of the representative species within UHGG are cultured isolates vs. MAGs? Are there differences in detection between the two groups?
Figure 1: Schematic of data architecture
Analysis
Workflow Overview
flowchart TD
A(Quality Filtered \n Foldseek Hits) --> |Select top 20 Targets\nper Query-DB by LDDT| B(Microbial Mimetic \nQueries \n n=4,783 PDBs)
B(Microbial Mimetic \nQueries \n n=4,783 PDBs) --> G{DIAMOND\nProtein Search}
C[(UHGG Species \n n=4,744 genomes)] --> G{DIAMOND\nProtein Search}
style C fill:#C8D1F7
E[(UHGP-95 \n Protein Catalog \n 20+ million seqs)] --> G{DIAMOND\nProtein Search}
style E fill:#C8D1F7
F[(hCom2 \n Synthetic Human Gut \n Microbiome \n n=200 genomes)] --> G{DIAMOND\nProtein Search}
style F fill:#C8D1F7
G{DIAMOND\nProtein Search} --> D(Human Microbiome \n Microbial Mimetic \n Candidates)
# download representative species proteomesfuture::plan(multisession, workers =2)species_reps %>%keep(!file.exists(glue("{homedir}/Downloads/RefDBs/mgnify_genomes/human-gut/v2.0.1/","{.}.faa" ))) %>% furrr::future_walk(~shell_do(glue("wget -P"," /central/groups/MazmanianLab/joeB/Downloads/","RefDBs/mgnify_genomes/human-gut/v2.0.1/"," https://www.ebi.ac.uk/metagenomics/api/v1/genomes/","{.}/downloads/{.}.faa" ) ) )# downloading UHGG phylogenetic treesshell_do(glue("wget -P {wkdir}/data/input/mgnify/"," http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0.1/phylogenies/ar122_iqtree.nwk" ))shell_do(glue("wget -P {wkdir}/data/input/mgnify/"," http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0.1/phylogenies/bac120_iqtree.nwk" ))# compile all representative species proteomes into a single filegenome_paths <-list.files(glue("{mgnify_dir}/human-gut/v2.0.1"),pattern ="*.faa", full.names =TRUE )pb <- progress_bar$new(total =length(genome_paths),format ="[:bar] :current/:total (:percent)" )for (f in genome_paths) { pb$tick()shell_do(glue("cat {f} >> {mgnify_dir}/human-gut_v2.0.1.fasta"))}# creating a DIAMOND DBdmnd_db <-glue("{homedir}/Downloads/RefDBs/diamondDBs/mgnify_human-gut_v2.0.1")dmnd_cmd <-glue("diamond makedb"," --in {mgnify_dir}/human-gut_v2.0.1.fasta"," -d {dmnd_db}")slurm_shell_do(dmnd_cmd, walltime =7200, memory ="10G")
2. UHGP-95 Protein Catalog
Code
# Creating a DIAMOND DB for the UHGP-95 protein catalogdmnd_db_uhgp95 <-glue("{homedir}/Downloads/RefDBs/diamondDBs/uhgp-95_v2.0.1")dmnd_cmd <-glue("diamond makedb"," --in {protein_catalogs}/UHGP_v2.0.1/uhgp-95/uhgp-95.faa"," -d {dmnd_db_uhgp95}")slurm_shell_do(dmnd_cmd, walltime =36000, memory ="100G")
3. hCom2 Synthetic Human Gut Microbiome
Here we download hCom2 MAGs and generate a protein db for each individual strain using prodigal. We then concatenate all of the protein fasta files into a single file and create a DIAMOND DB.
Code
# Initiate future.batchtools backend for parallel processingfuture::plan( future.batchtools::batchtools_slurm,template =glue("{wkdir}/batchtools_templates/batchtools.slurm.tmpl"),resources =list(name ="prodigal-HCom2",memory ="5G",ncpus =1,walltime =3600 ))# select input genome fasta files paths for prodigalhcom2_dir <-glue("{homedir}/Downloads/RefDBs/hCom2")hcom2_proteins_dir <-glue("{hcom2_dir}/protein_fastas")dir.create(hcom2_proteins_dir, showWarnings =FALSE)hcom2_genome_paths <-list.files(path =glue("/central/groups/MazmanianLab/shared/","reference_genomes/hCom2_transfer/fasta" ),full.names =TRUE )hadza_mag_unprocessed_lgr <- hcom2_genome_paths %>% purrr::map( ~!file.exists(glue("{hcom2_proteins_dir}/{basename(.)}")))hcom2_genome_paths_to_process <- hcom2_genome_paths %>%keep(unlist(hadza_mag_unprocessed_lgr))# generate list of desired fasta outputshcom2_proteins_paths <- purrr::map( hcom2_genome_paths_to_process,~glue("{hcom2_proteins_dir}/{fs::path_ext_remove(basename(.))}.faa") )# Chunk files (5 per job) and downloadtic()n_jobs <-ceiling(length(hcom2_genome_paths_to_process) /5)prodigal_runs <-listenv()for (job in1:n_jobs) { input_list <-chunk_func(hcom2_genome_paths_to_process, n_jobs)[[job]] output_list <-chunk_func(hcom2_proteins_paths, n_jobs)[[job]] prodigal_runs[[job]] %<-%shell_do_prodigal_list(input_genome_list = input_list,output_fasta_list = output_list )}toc()hCom2_faa_files <-list.files(hcom2_proteins_dir, full.names =TRUE)hCom2_faa_files %>%length()# concatenate all fasta files into a single filepb <- progress_bar$new(total =length(hCom2_faa_files),format ="[:bar] :current/:total (:percent)" )for (f in hCom2_faa_files) { pb$tick()shell_do(glue("cat {f} >> {hcom2_dir}/hCom2.fasta" ))}# create a DIAMOND DB out of HCom2 proteomesdmnd_db_hcom2 <-glue("{homedir}/Downloads/RefDBs/diamondDBs/hCom2")dmnd_cmd <-glue("diamond makedb"," --in {hcom2_dir}/hCom2.fasta"," -d {dmnd_db_hcom2}")slurm_shell_do(dmnd_cmd, walltime =7200, memory ="10G")